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Abstract 

The Astree static analyzer is a specialized tool that can prove the 
absence of runtime errors, including arithmetic overflows, in large critical 
programs. Keeping analysis times reasonable for industrial use is one of 
the design objectives. In this paper, we discuss the parallel implementa- 
tion of the analysis. 



> ■ 1 Introduction 



ON ■ The Astree static analyzeiQ is a tool that analyzes, fully automatically, single- 

threaded programs written in a subset of the C programming language, suffi- 
cient for many typical critical embedded programs. The tool particularly targets 
\ control/ command applications using many floating-point variables and numeri- 

cal filters, though it has been successfully applied to other categories of software. 
! It computes a super-set of the possible run-time errors. Astree is designed for 

O ■ efficiency on large software: hundreds of thousands of lines of code are analyzed 

in a matter of hours, while producing very few false alarms. For example, some 
fly-by-wire avionics reactive control codes (70 000 and 380 000 lines respectively, 
the latter of a much more complex design) are analyzed in lh and 10 h 30' 
respectively on current single-CPU PCs, with no false alarm [IJ[2J[H]. 

Other contributions [Till EH E3 [13 [THl [T3] have described the abstract do- 
mains used in Astree; that is, the data structures and algorithms implementing 
the symbolic operations over abstract set of reachable states in the program to 
be analyzed. However, the operations in these abstract domains must be driven 
by an iterator, which follows the control flow of the program to be analyzed 
and calls the necessary operations. This paper describes some characteristics 
of the iterator. We first explain some peculiarities of our iteration algorithm as 
well as some implementation techniques regarding efficient shared data struc- 
tures. These have an impact on the main contribution of the paper, which is 
the parallelization technique implemented in Astree. 



J http : //www . astree . ens . f r 
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Even though Astree presented good enough performances to be used in 
practical settings on large-scale industrial code on single-processor systems, we 
designed a parallel implementation suitable both for shared-memory multipro- 
cessor systems and for small clusters of PCs over local area networks. Astree 
being focused on synchronous, statically scheduled reactive programs, we used 
that peculiar form of the program to be analyzed in order to design a very sim- 
ple, yet efficient, parallelization scheme. We however show that the control-flow 
properties that enable such parallel analysis are to be found in other kinds of 
programs, including major classes of programs such as event-driven user inter- 
faces. 

Section [2] describes the overall structure of the interpreter and the most sig- 
nificant choices about the iteration strategy. This defines the framework within 
which we implement our parallelization scheme. We also discuss implementation 
choices for some data structures, which have a large impact on the simplicity 
and efficiency of the parallel implementation. 

Section [3] describes the parallelization of the abstract interpreter in a range 
of practical cases. 

2 The ASTREE Abstract Interpreter 

Our static analyzer is structured in several hierarchical layers: 

• a denotational-based abstract interpreter abstractly executes the instruc- 
tions in the programs by sending orders to the abstract domains; 

• a partitioning domain |12j handles the partitioning of traces depending on 
various criteria; it also operates the partitioning with respect to the call stack; 

• a branching abstract domain handles forward branching constructs such 
as forward goto, break, continue; 

• a structure abstract domain resolves all accesses to complex data structures 
(arrays, pointers, records...) into may- or must-aliases over abstract cells [H 
§6.1]; 

• various numerical domains express different kinds of constraints over those 
cells; each of these domains can query other domains for information, and send 
information to those domains (reduction). 

2.1 A denotational-based interpreter 

Contrary to some presentations or examples of abstract interpretation-based 
static analysis [7] , we did not choose to obtain results through the direct reso- 
lution (or over-approximation) of a system of semantic equations, but rather to 
follow the denotational semantics of the program as in [6j Sect. 13]. 
Consider the following fragment of the C programming language: 
I ::= x | t[e] . . . 1-values 

e ::= I \ e © e | Qe | . . . expressions (© 6 {+, *,...}; G {— , . . .}) 
s ::= t x; | I = e \ if(e){s; . . . ; s}else{s; . . . ; s} | while(e){s; . . . ; s} 
L is the set of control states, r is any type, L (resp. E) is the set of 1-values 
(resp. expressions). The concrete semantics of a statement s is a function 
[sj : M — > V(M) x V(E) where M (resp. E) is the set of memory states (resp. 
errors). Given an abstraction of sets of stores defined by an abstract domain D M 
and a concretization function 7m : D^ M — > V(M), we can derive an approximate 
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abstract semantics [PJ" : D^ M — > Z)^ of program fragment P by following the 
methodology of abstract interpretation [5] . 

The soundness of [PJ* can be stated as follows: if [P](p) = (m ,eo) and 
p € 7m(^")) then m C 7Af([P]"(d*)) (resp. for the error list). The principle 
of the interpreter is to compute [P]"(d") by induction on the syntax of P. 
D^ M should provide abstract counterparts (assign, new.var, deLvar, guard) 
to the concrete operations (assignment, variable creation and deletion, condition 
test). For instance, assign should be a function in L x E x D\ 4 — > D^ M , that 
inputs an 1-value, an expression and an abstract value and returns a sound over- 
approximation of the set of stores resulting from the assignment: V/ G L, Ve G E, 

G D* M , {p[[/J(p) » [e](p)] | p G 7m(^)} C 7M (assign(/,e,d»)). Soundness 
conditions for the other operations (guard, new.var, deLvar) are similar. 

[Z = e;f(S) = assignee, d") where I G L, e G E 

[{riisofid') = del_var(rx,[s ] ll ( new - var ( < r ar > d *))) 
[if(e) s else Sl ; = [s J S (guard(e, t, S)) U [ Sl J«(guard(e, f , $)) 

[while(e)s ] tt (d tJ ) = guard(e, f , lfp s 0») 

where: ^ : x s G d\j ^Su [s J 8 (guard (e, t, X s )) 

The function hp" computes a post-fixpoint of any abstract function (i.e., 
approximation of the concrete least-fixpoint). While the actual scheme imple- 
mented is somewhat complex, it is sufficient to say that hp /* outputs some x* 
such that /"(a;") □"a;" for some decidable ordering such that Vx", x'c'y 11 => 
7(x") C 7(2/"). This abstract fixpoint is sought by the iterator in "iteration 
mode": possible warnings that could occur within the code are not displayed 
when encountered. Then, once L" = Ifp /" is computed — an invariant for the 
loop body — , the iterator analyzes the loop body again and displays possible 
warnings. As a supplemental safety measure, we check again that /«(L«)C S L S I 

U is an abstraction of the concrete union U: Vdi, g^", 7(^1") U 7(^2") Q 
7K"ud 2 «). 

An abstract domain handles the call stack; currently in Astree, it amounts 
to partitioning states by the full calling context [12]. Astree does not handle 
recursive functions^ this is not a problem with critical embedded code, since 
programming guidelines for such systems generally prohibit the use of recursive 
functions. Functions are analyzed as if they were inlined at the point of call. 
Multiple targets for function pointers are analyzed separately, and the results 
merged with U; see [j3]for an application to parallelization. 

Other forms of branches are dealt with by an extension of the abstract se- 
mantics. Explicit gotos are rarely used in C, except as forward branches to error 
handlers or for exiting multiple loops; however, semantically similar branching 

2 Let us note that the computationally costly part of the analysis is finding the loop invari- 
ant, rather than checking it. P. Cousot suggested the following improvement over our existing 
analysis: using different implementations for finding the invariant and checking it (at present, 
the same program does both). . For instance, the checking phase could be a possibly less 
efficient version, whose safety would be formally proved. However, since all abstract domains 
and most associated algorithms would have to be implemented in that "safe" analyzer, the 
amount of work involved would be considerable and we have not done it at this point. Also, 
as discussed in Sect. 13.21 both implementations would have to yield identical results, which 
means that the "safe" analysis would have to mimic the "unsafe" one in detail. 

3 More precisely, it can analyze recursive programs, but analysis may fail to terminate. If 
analysis terminates, then its results are sound. 
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structures are very usual and include cases structures, break statements and 
return statements. Indeed, a return statement return e carries out two op- 
erations: first, it evaluates e and stores the value as the function result; then, 
it terminates the current function, i.e. branches to the end of the function. In 
this paper, we only consider the case of forward-branching goto's; the other 
constructs then are straightforward extensions. 

We extend the syntax of statements with a goto statement goto I where I is a 
program point (we implicitly assume that there is a label before each statement 
in the program) . The execution of a statement s may yield either a new memory 
state or a branching to a point after s. Therefore, we lift the definition of 
the semantics into a function [s] : (M x (L -> V{M))) -> (V{M) x (L -> 
V(M)) x E). The concrete states before and after each statement no longer 
consist solely of a set of memory states, but of a set of memory states for 
the "direct" control- flow as well as a set of memory states for each label I, 
representing all the memory states that have occurred in the past for which a 
forward branch to I was requested. 

The concrete semantics of goto I is defined by: [goto ?;](!», 4>i) — (J-,<j>i[l 
i — ► 4>i{l) U Ii]) and the concrete semantics of a statement s at label I is defined 
from the semantics without branches as: \l : s;](7j,^,) = ({Ii} U (j>i(l), (pi). 
The definition of the abstract semantics can be extended in the same way. We 
straightforwardly lift the abstract semantics of a statement s into a function 
lsp:Dlx(L^Dl)^D* M x(L^Di). 

2.2 Rationale and efficiency issues 

The choice of the denotational approach was made for two reasons : 

• Iteration and widening techniques on general graph representations of pro- 
grams are more complex. Essentially, these techniques partly have to 
reconstruct the natural control flow of the program so as to obtain an effi- 
cient propagation flow j3j[TT] ■ Since our programs are block-structured and 
do not contain backward goto's, this flow information is already present 
in their syntax; there is no need to reconstruct it. Sect. 13] 

• It minimizes the amount of memory used for storing the abstract environ- 
ments. While our storage methods maximize the sharing between abstract 
environments, our experiments showed that storing an abstract environ- 
ment for each program point (or even each branching point) in main mem- 
ory was too costly. Good forward/backward iteration techniques do not 
need to store environments at that many points, but this measurement 
still was an indication that there would be difficulties in implementing 
such schemes. 

We measured the memory required for storing the local invariants at part or 
all of the program points, for three industrial control programs representative 
of those we are interested in; see the table below. We performed several mea- 
surements, depending on whether invariant data was saved at all statements, 
at the beginning and end of each block, and at the beginning and end of each 
function. 

For each program and measurement, we provide two figures: from left to 
right, the peak memory observed during the analysis, then the size of the serial- 
ized invariant (serialization is performed for saving to files or for parallelization 
purposes, and preserves the sharing property of the internal representation). 



4 



Benchmarks (see below) show that keeping local invariants at the boundaries 
of every block in main memory is not practical on large programs; even restrict- 
ing to the boundaries of functions results in a major overhead. A database 
system for storing invariants on secondary storage could be an option, but Brat 
and Venet have reported significant difficulties and complexity with that ap- 
proach [3]. Furthermore, such an approach would complicate memory sharing, 
and perhaps force the use of solutions such as "hash-consing" , which we have 
avoided so far. 

Memory requirements are expressed in megabytes; analyses were run in 64- 
bit mode on a dual Opteron 2.2 GHz machine with 8 Gb RAM0 On many 
occasions, we had to abort the computation due to large memory requirements 
causing the system to swap. 





Program 1 


Program 2 


Program 3 


# of lines of C code 


67,553 


232,859 


412,858 


# of functions 


650 


1,900 


2,900 


Save at all statements 


3300 


688 


> 8000 


swap 


> 8000 


swap 


Save at beginning / end of blocks 


2300 


463 


> 8000 


swap 


> 8000 


swap 


Save at beginning / end of functions 


690 


87 


2480 


264 


4800 


428 


Save main loop invariant only 


415 


15 


1544 


53 


2477 


96 


No save 


410 


1544 


2440 



Benchmarks courtesy of X. Rival. 



Memorizing invariants at the head of loops (the least set of invariants we 
can keep so as to be able to compute widening chains) thus entails much smaller 
memory requirements than a naive graph-based implementation; the latter is 
intractable on reasonably-priced current computers on the class of large pro- 
grams that we are interested in. It is possible that more complex memorization 
schemes may make graph-based algorithms tractable, but we did not investigate 
such schemes because we had an efficient and simple working system. 

Regarding efficiency, it soon became apparent that a major factor was the 
efficiency of the U operation. In a typical program, the number of tests will be 
roughly linear in the length of the code. In the control programs that Astree 
targets, the number of state variables (the values of which are kept across it- 
erations) is also roughly linear in the length I of the code. This means that 
if the U operation takes linear time in the number of variables — an appar- 
ently good complexity — , an iteration of the analyzer takes 6(^ 2 ) time, which 
is prohibitive. We therefore argue that what matters is the complexity of U with 
respect to the number of updated variables, which should be almost linear: if 
only ri\ (resp. n-i) variables are touched in the if branch (resp. else branch), 
then the overall complexity should be at most roughly 0{n\ + n 2 )- 

We achieve such complexity with our implementation using balanced trees 
with strong memory sharing and "short-cuts" [TJ §6.2]. Experimentation showed 
that memory sharing was good with the rough physical equality tests that we 
implement, without the need for much more costly techniques such as hash 
consing. Indeed, experiments show that considerable sharing is kept after the 
abstract execution of program parts that modify only parts of the global state 
(see the A-compression in t j3.2p . Though simple, this memory-saving technique 
is fragile; data sharing must be conserved by all modules in the program. This 
obligation had an impact on the design of the communications between parallel 
processes. 

4 Memory requirements are smaller on 32-bit systems. 
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3 Parallelization 



In iteration mode, we analyze tests in the following way: |if(e) so else s\; ]"(<$) 
= [so]"(guard(e, t, d")) U [si]"(guard(e, f , $)). The analyses of sq and si may 
be conducted in total separation, in different threads or processes, or even on 
different machines. Similarly, the semantics of an indirect function call may 
be approximated as: |(*f ) () ;]"(a") = U fl e[f](o«) [5]*( a ") : 9 ranges on all the 
possible code blocks to which f may point. 

3.1 Dispatch points 

In usual programs, most tests split the control flow between short sequences 
of execution; the overhead of analyzing such short paths in separate processes 
would therefore be considerable with respect to the length of the analysis itself. 
However, there exist wide classes of programs where a few tests (at "dispatch 
points" ) divide the control flow between long executions. In particular, there 
exist several important kinds of software consisting in a large event loop: the 
system waits for an event, then a "dispatcher" runs an appropriate (often fairly 
complex) handler routine: 

Initialization 
while true do 

wait for a request r 

dispatch according to the type of r in 
handler for requests of the first type 
handler for requests of the second type 

done 

This program structure is quite frequent in network services and traditional 
graphical user interfaces (though nowadays often wrapped inside a callback 
mechanism) : 

Initialization 
while true do 

wait for an event e 
dispatch according to e in 

event handler 1 

event handler 2 

done 

Many critical embedded programs are also of this form. We analyze reactive 
programs that, for the most part, implement in software a directed graph of nu- 
meric filters. Those numeric filters are in general the discrete-time counterparts 
of hardware, continuous-time components, with various sampling rates. The 
system is thus made of n components, each clocked with a period pi ■ p, where 
1/p is a master clock rate (say, 1 kHz). It is statically scheduled as a succession 
of "sequencers" numbered from to N — 1 where N is the least common mul- 
tiple of the pi. A task of period pi.p is scheduled in all sequencers numbered 
k.pi + Cj. Ci may often be arbitrarily chosen; judicious choices of Cj allow for 
static load balancing, especially with respect to worst-case execution time (all 
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sequencers should complete within a time of p). Thus, the resulting program is 
of the form: 

Initialization 
while true do 

wait for clock tick (1/p Hz) 
dispatch according to i in 

sequencer 

sequencer 1 

i ;= i + 1 (mod N) 
done 

Our analysis is imprecise with respect to the succession of values of i; indeed, 
it soon approximates i by the full interval [0, N — 1]. This is equivalent to saying 
that any of the sequencers is nondeterministically chosen. Yet, due to the nature 
of the studied system, this is not a hindrance to proving the absence of runtime 
errors: each of the n subcomponents should be individually sound, whatever 
the sampling rate, and the global stability of the system does not rely on the 
sampling rates of the subcomponents, within reasonable bounds. 

Our system could actually handle parallelization at any place in the program 
where there exist two or more different control flows, by splitting the flows be- 
tween several processors or machines; it is however undesirable to fork processes 
or launch remote analyses for simple blocks of code. Our current system decides 
the splitting point according to some simple ad-hoc criterion, but we could use 
some more universal criteria. For instance, the analyzer, during the first it- 
eration^), could measure the analysis times of all branches in if, switch or 
multi-aliases function calls; if a control flow choice takes place between several 
blocks for which the analysis takes a long time, the analysis could be split be- 
tween several machines in the following iterations. To be efficient, however, such 
a system would have to do some relatively complex workload allocation between 
processors and machines; we will thus only implement it when really necessary. 

3.2 Parallelization implementation 

Instead of analyzing all dispatch branches in sequence, we split the workload 
between p several processors (possibly in different machines). We replace the 
iterative computation of f*(X*) = lPif(X*) U (lP 2 f(X^ U . . . [PJ^X*)) . . . ) 
by a parallel computation / J (X») = ULidJfce^ l p kf( xt )) wh ere the n k are 
a partition of {l,...,n}. Let us note tj the time needed to compute [Pj]". 
Ik = J2jen k T j i s the ti me spent by processor %. 

For maximal efficiency, we would prefer that the U should be close to each 
other, so as to minimize the synchronization waits. Unfortunately, the prob- 
lem of optimally partitioning into the TTk is NP-hard even in the case where 
p = 2 [13]. If the Tj are too diverse, randomly shuffling the list may yield im- 
proved performance. In practice, the real-time programs that we analyze are 
scheduled so that all the Pj have about the same worst-case execution time, so 
as to ensure maximal efficiency of the embedded processor; consequently, the 
Ti are reasonably close together and random shuffling does not bring significant 
improvement; in fact, in can occasionally reduce performances. 
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Figure 1: Parallelization performances (dual-2.2 GHz Opteron machine + 2 GHz 
AMD64 machines). 

For large programs of the class we are interested in, the analysis times (Fig.[T]) 
for n processors is approximately 0.75/n + 0.25 times the analysis time on one 
processor; thus, clusters of more than 3 or 4 processors are not much interesting: 





Prog 1 


Prog 2 


Prog 3 


# lines 


67,553 


232,859 


412,858 


1 CPU 


26'28" 


5h 55' 


llh 30' 


2 CPU 


16'38" 


3h 43' 


7h 09' 


3 CPU 


14'47" 


2h 58' 


5h 50' 


4 CPU 


13'35" 


2h 38' 


5h 06' 


5 CPU 


13'26" 


2h 25' 


4h 44' 



Venet and Brat also have experimented with parallelization [19, §5], with 
similar conclusions; however, the class of programs to be analyzed and the 
expected precisions of their analysis are too different from ours to make direct 
comparisons meaningful. 

Because it is difficult to determine the ti in advance, Astree features an 
optional randomized scheduling strategy, which reduces computation times on 
our examples by 5%, with computation times on 2 CPUs ~58% of those on 1. 

We reduced transmission costs by sending only the differences between ab- 
stract values at the input and the output — when the remote computation is 
[Pj"(ri"), only answer the difference between $ and |P]"(d"). This difference is 
obtained by physical comparison of data structures, excluding shared subtrees 
(Sect. |2~2")) . The advantage of that method is twofold: 

• Experimentally, such "A-compression" results in transmissions of about 
10% of the full size on our examples. This reduces transmission costs on 
networked implementations. 

• Recall that we make analysis tractable by sharing data structures (Sect. [2~2"l) 
We however enforce this sharing by simple pointer comparisons (i.e. we do 
not construct another copy of a node if our procedure happens to have the 
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original node at its disposal), which is fast and simple but does not guar- 
antee optimal sharing. Any data coming from the network, even though 
logically equal to some data already present in memory, will be loaded at 
a different location; thus, one should avoid merging in redundant data. 
Sending only the difference back to the master analyzer thus dramati- 
cally reduces the amount of unshared data created by networked merge 
operations. 

We request that the U operator should be associative and commutative, so 
that /" does not depend on the chosen partitioning. Such a dependency would 
be detrimental for two reasons: 

• If the subprograms P\, . . . , P n are enclosed within a loop, the nondeter- 
minism of the abstract transfer function /" complicates the analysis of 
the loop. As we said in Sect. 12.11 we use an "abstract fixpoint" operator 
lfp" that terminates when it finds L" such that /"(£") E Because 
this check is performed at least twice, it would be undesirable that the 
comparisons yield inconsistent results. 

• For debugging and end-user purposes, it is undesirable that the results 
of the analysis could vary for the same analyzer and inputs because of 
runtime vagaries. 

In this case of a loop around the P±, . . . ,P n , we could have alternatively 
used asynchronous iterations [5]. To compute hp/", one can use a central 
repository X\ initially containing _!_; then, any processor i computes ff(X^) = 
Uke-Ki l p kf( xt ) and replaces X* with X^S/ff(X i ). If the scheduling is fair (no 
[Pfc] is ignored indefinitely), such iterations converge to an approximation of the 
least fixpoint of X ^ Ut JP/cJ (X). However, we did not implement our analyzer 
this way. Apart from the added complexity, the nondeterminism of the results 
was undesirable. 

4 Conclusion 

We have investigated both theoretical and practical matters regarding the com- 
putation of fixpoints and iteration strategies for static analysis of single-threaded, 

5 For the same reasons, care should be exercised in networked implementations so that dif- 
ferent platforms output the same analysis results on the same inputs. Subtle problems may 
occur in that respect; for instance, there may be differences between floating-point imple- 
mentations. We use the native floating-point implementation of the analysis platform; even 
though all our host platforms are IEEE-compatible, the exact same analysis code may yield 
different results on various platforms, because implementations are allowed to provide more 
precision that requested by the norm. For example, the IA32™ (Intel Pentium™) platform 
under Linux™ (and some other operating systems) computes by default internally with 80 
bits of precision upon values that are specified to be 64-bit IEEE double precision values. 
Thus, the result of computations on that platform may depend on the register scheduling 
done by the compiler, and may also differ from results obtained on platforms doing all compu- 
tations on 64-bit floating point numbers (PowerPC™, and even IA32™and AMD64™with 
some code generation and system settings). Analysis results, in all cases, would be sound, 
but they would differ between implementations, which would be undesirable for the sake of 
reproducibility and debugging, and also for parallelization, as explained here. We thus force 
(if possible) the use of double (and sometimes single) precision IEEE floating-point numbers 
in all computations within the analyzer. 
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block-structured programs, and proposed methods especially suited for the anal- 
ysis of large synchronous programs: a dcnotational iteration scheme, maximal 
data sharing between abstract invariants, and parallclization schemes. In several 
occasions, we have identified possible extensions of our system. 

Two major problems we have had to deal with were long computation times, 
and, more strikingly, large memory requirements, both owing to the very large 
size of the programs that we consider. Additionally, we had to keep a very high 
precision of analysis over complex numerical properties in order to be able to 
certify the absence of runtime errors in the industrial programs considered. 

We think that several of these methods will apply to other classes of pro- 
grams. Parallelization techniques, perhaps extended, should apply to wide 
classes of event-driven programs; loop iteration techniques should apply to any 
single-threaded programs; data sharing and "union" optimizations should apply 
to any static analyzer. We also have identified various issues of a generic interest 
with respect to widenings and narrowings. 

Acknowledgments: We wish to thank P. Cousot and X. Rival, as well as the 
rest of the Astree team. 
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